~ chicken-core (master) /manual/Data representation


  1[[toc:]]
  2[[tags: manual]]
  3
  4== Data representation
  5
  6There exist two different kinds of data objects in the CHICKEN system:
  7immediate and non-immediate objects. 
  8
  9=== Immediate objects
 10
 11Immediate objects are represented by a single machine word, 32 or 64 bits depending on the architecture.  They come in four different flavors:
 12
 13'''fixnums''', that is, small exact integers, where the lowest order bit is
 14set to 1. This gives fixnums a range of 31 bits for the actual
 15numeric value (63 bits on 64-bit architectures).
 16
 17'''characters''', where the four lowest-order bits are equal to
 18{{C_CHARACTER_BITS}}, currently 1010. The Unicode code point
 19of the character is encoded in the next 24 bits, from 0x00 to 0x1FFFFF.
 20Code points from 0xDC80 to 0xDCFF are not allowed and used to
 21mark invalid UTF-8 bytes in strings (see below).
 22
 23'''booleans''', where the four lowest-order bits are equal to {{C_BOOLEAN_BITS}},
 24currently 0110. The next bit is one for #t and zero for #f.
 25
 26'''other values''': the empty list, the value of unbound identifiers,
 27the undefined value (void), and end-of-file.  The four lowest-order bits are equal to
 28{{C_SPECIAL_BITS}}, currently 1110.  The next four bits contain an identifying
 29number for this type of object, one of:
 30{{C_SCHEME_END_OF_LIST}}, currently 0000;
 31{{C_SCHEME_UNDEFINED}}, currently 0001;
 32{{C_SCHEME_UNBOUND}}, currently 0010; or
 33{{C_SCHEME_END_OF_FILE}}, currently 0011.
 34
 35=== Non-immediate objects
 36
 37Collectively, the two lowest-order bits are known as the ''immediate mark bits''.  When the lowest bit is set, the object is a fixnum, as described above, and the next bit is part of its value.  When the lowest bit is clear but the next bit is set, it is an immediate object other than a fixnum.  If neither bit is set, the object is non-immediate, as described below.
 38
 39Non-immediate objects are blocks of data represented by a pointer into
 40the heap.  The pointer's immediate mark bits must be zero to indicate the object is non-immediate;
 41this guarantees the data block is aligned on a 4-byte boundary, at minimum.  Alignment of data words
 42is required on modern architectures anyway, so we get the ability to distinguish between immediate and non-immediate objects for free.
 43
 44The first word of the data block contains a header, which gives
 45information about the type of the object. The header is a
 46single machine word.
 47
 48The 24 (56 on 64-bit systems) lowest-order bits contain the length of
 49the data object, which is either the number of bytes in a string or
 50byte-vector, or the number of elements for a vector or record type. This
 51allows a maximum size for string or byte-vectors of 2^24 bytes, or
 52approximately 16 MB, on 32-bit systems, and 2^56 bytes, or approximately
 5372 PB, on 64-bit systems.
 54
 55The remaining bits are placed in the high-order end of the header.
 56The four highest-order bits are used for garbage
 57collection or internal data type dispatching.
 58
 59; C_GC_FORWARDING_BIT : Flag used for forwarding garbage collected object pointers.
 60
 61; C_BYTEBLOCK_BIT : Flag that specifies whether this data object contains raw bytes (a bytevector) or pointers to other data objects.
 62
 63; C_SPECIALBLOCK_BIT : Flag that specifies whether this object contains a ''special'' non-object pointer value in its first slot. An example for this kind of objects are closures, which are a vector-type object with the code-pointer as the first item.  This is also used to turn a pair's car into a weak reference in the symbol table, to allow its symbol to be collected.
 64
 65; C_8ALIGN_BIT : Flag that specifies whether the data area of this block should be aligned on an 8-byte boundary (floating-point numbers, for example).
 66
 67After these four bits comes a 4-bit type code representing one of the following types:
 68
 69'''vectors''': vector objects with type bits {{C_VECTOR_TYPE}}, currently 0000.
 70
 71'''symbols''': vector objects with type bits {{C_SYMBOL_TYPE}},
 72currently 0001. The three slots contain the toplevel variable value,
 73the print-name (a string), and the property list of the symbol.  When
 74manipulating these slots, the symbol table's container needs to be
 75manipulated as well, to control garbage collection of the symbol: if
 76the symbol is undefined and has no property list, the symbol table's
 77container should be a weak reference ({{C_WEAK_PAIR_TYPE}}), otherwise
 78it should be a strong reference ({{C_PAIR_TYPE}}).
 79
 80'''strings''': objects with type bits {{C_STRING_TYPE}}, currently 0010,
 81holding a bytevector with the string contents represented as an UTF-8 encoded 
 82byte sequence. Note that invalid UTF-8 sequences are allowed inside strings,
 83extracted characters at positions that contain invalid UTF-8
 84sequences are represented as the trailing surrogate pair half (U+DC00|xx).
 85
 86'''pairs''': vector-like object with type bits {{C_PAIR_TYPE}}, currently 0011.
 87The car and the cdr are contained in the first and second slots,
 88respectively.
 89
 90'''closures''': special vector objects with type bits
 91{{C_CLOSURE_TYPE}}, currently 0100. The first slot contains a pointer to a
 92compiled C function. Any extra slots contain the free variables (since
 93a flat closure representation is used).
 94
 95'''flonums''': byte-vector objects with type bits
 96{{C_FLONUM_BITS}}, currently 0101. Slots one and two (or a single slot on
 9764 bit architectures) contain a 64-bit floating-point number, in the
 98representation used by the host system's C compiler.
 99
100'''bignums''': special vector objects with type bits
101{{C_BIGNUM_TYPE}}, currently 0110.  This contains only one slot, which
102points to a bytevector that contains the number's limbs in a special
103format: The first word contains a 1 if the number is negative, 0 if it
104is positive.  The remaining words form the bignum's limbs.  A
105bytevector is used because the limbs are stored in the raw machine
106format, which would be invalid Scheme objects when viewed as slots in
107a vector.
108
109'''ports''': special vector objects with type bits
110{{C_PORT_TYPE}}, currently 0111. The first slot contains a pointer to a file-
111stream, if this is a file-pointer, or NULL if not. The other slots
112contain housekeeping data used for this port.
113
114'''structures''': vector objects with type bits
115{{C_STRUCTURE_TYPE}}, currently 1000. The first slot contains a symbol that
116specifies the kind of structure this record is an instance of. The other
117slots contain the actual record items.
118
119'''bytevector''': a raw sequence of bytes with type bits {{C_BYTEVECTOR_TYPE}}.
120
121'''pointer-vectors''': vector objects of native pointers - these are
122actually structures where the first slot holds a bytevector containing the 32- or 64-bit
123pointer values.
124
125'''locatives''': special vector objects with type bits
126{{C_LOCATIVE_TYPE}}, currently 1010.  A locative object holds 4 slots:
127a raw pointer to the location inside the object referred to by the locative,
128the offset in bytes from the start of the object referred to, the type of
129the location (whether it refers to an unboxed numeric value or a normal
130object slot that holds a pointer to Scheme data) and a flag indicating
131whether this locative is "weak". If the locative is non-weak, slot #4 holds
132a pointer to the object referred to.
133
134'''pointers''': special vector objects with type bits
135{{C_POINTER_TYPE}}, currently 1001. The single slot contains a machine pointer.
136
137'''tagged pointers''': special vector objects with type bits 
138{{C_TAGGED_POINTER_TYPE}}, currently 1011, Tagged pointers are similar to pointers,
139but the object contains an additional
140slot with a tag (an arbitrary data object) that identifies the type
141of the pointer.
142
143'''ratnums''': vector-like objects with type-bits {{C_RATNUM_TYPE}},
144currently 1100.  The first slot contains the numerator (which can be
145positive or negative), the second slot contains the denominator, which
146is always positive.  These numbers are always simplified, so their gcd
147will always be 1.
148
149'''lambda infos''': byte-vector objects with type-bits {{C_LAMBDA_INFO_TYPE}}, currently 1101.
150
151'''cplxnums''': vector-like objects with type-bits {{C_CPLXNUM_TYPE}},
152currently 1110. The first slot contains the real part, the second slot
153contains the imaginary part of the complex number.  These two numbers
154are of matching exactness: Either both are flonums or none are.
155
156The actual data follows immediately after the header. Note that
157block addresses are always aligned to the native machine-word
158boundary.
159
160Data objects may be allocated outside of the garbage collected heap, as
161long as their layout follows the above mentioned scheme. But care has to
162be taken not to mutate these objects with heap-data (i.e. non-immediate
163objects), because this will confuse the garbage collector.
164
165For more information see the header file {{chicken.h}}.
166
167
168---
169Previous: [[C interface]]
170
171Next: [[Modules]]
Trap